Dataset: Medical Information Mart for Intensive Care-IV (MIMIC-IV)

Privacy-preserving large language models for structured medical information retrieval

NLP Tasks: Information Extraction

Method: an open-source pipeline using the local large language model (LLM) "Llama 2"

Metrics:

  • Accuracy (100% sensitivity, 96% specificity)
  • Sensitivity (Ascites: 95%)
  • Specificity (Ascites: 95%)
  • Sensitivity (Confusion: 76%)
  • Specificity (Confusion: 94%)
  • Sensitivity (Abdominal pain: 84%)
  • Specificity (Abdominal pain: 97%)
  • Sensitivity (Shortness of breath: 87%)
  • Specificity (Shortness of breath: 97%)

Racial, ethnic, and sex bias in large language model opioid recommendations for pain management

NLP Tasks: Text Classification, Question Answering, Information Extraction

Method: instructing large language models (LLMs), specifically GPT-4 and Gemini, to provide subjective pain ratings and comprehensive pain management recommendations

Metrics:

  • Pain rating severity (OR: 0.57, 95% CI: [0.54, 0.60], P < 0.001)
  • Strong opioid recommendation (OR: 2.05, 95% CI: [1.59, 2.66], P < 0.001)
  • Timing of opioid recommendation (OR: 1.41, 95% CI: [1.22, 1.62], P < 0.001)

Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes

NLP Tasks: Text Generation

Method: a novel approach for generating machine-generated instruction datasets for fine-tuning medical-specialized language models using MIMIC-IV discharge records

Metrics:

  • Mean ROUGE (0.185)
  • Validity rate by GPT-3.5 (88.0%)
  • Validity rate by human annotator (88.5%)

LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: LCD benchmark for predicting 30-day out-of-hospital mortality using discharge notes.

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Quantitative Evaluation of Large Language Models to Streamline Radiology Report Impressions: A Multimodal Retrospective Analysis

NLP Tasks: Text Generation, Text Classification, Question Answering

Method: Comparative analysis of four publicly available large language models (LLMs)

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

AnnoDash, a clinical terminology annotation dashboard

NLP Tasks: Information Extraction, Text Classification, Named Entity Recognition

Method: AnnoDash, a flexible dashboard to support annotation of concepts with terms from a given ontology.

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Evaluation and mitigation of the limitations of large language models in clinical decision-making

NLP Tasks: Information Extraction, Text Classification, Question Answering

Method: Creating a framework to simulate a realistic clinical setting using a curated dataset based on the Medical Information Mart for Intensive Care database

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)